47 research outputs found

    Arabic/Latin and Machine-printed/Handwritten Word Discrimination using HOG-based Shape Descriptor

    Get PDF
    In this paper, we present an approach for Arabic and Latin script and its type identification based onHistogram of Oriented Gradients (HOG) descriptors. HOGs are first applied at word level based on writingorientation analysis. Then, they are extended to word image partitions to capture fine and discriminativedetails. Pyramid HOG are also used to study their effects on different observation levels of the image.Finally, co-occurrence matrices of HOG are performed to consider spatial information between pairs ofpixels which is not taken into account in basic HOG. A genetic algorithm is applied to select the potentialinformative features combinations which maximizes the classification accuracy. The output is a relativelyshort descriptor that provides an effective input to a Bayes-based classifier. Experimental results on a set ofwords, extracted from standard databases, show that our identification system is robust and provides goodword script and type identification: 99.07% of words are correctly classified

    Inexact graph matching for entity recognition in OCRed documents

    Get PDF
    International audienceThis paper proposes an entity recognition system in image documents recognized by OCR. The system is based on a graph matching technique and is guided by a database describing the entities in its records. The input of the system is a document which is labeled by the entity attributes. A first grouping of those labels based on a function score leads to a selected set of candidate entities. The entity labels which are locally close are modeled by a structure graph. This graph is matched with model graphs learned for this purpose. The graph matching technique relies on a specific cost function that integrates the feature dissimilarities. The matching results are exploited to correct the mislabeling errors and then validate the entity recognition task. The system evaluation on three datasets which treat different kind of entities shows a variation between 88.3% and 95% for recall and 94.3% and 95.7% for precision

    Rapport Evaluation des OCR

    Get PDF
    De nos jours, nous utilisons énormément de documents papier (de type administratif, rapport, publicité...). Le développement important de l’informatique a créé le besoin de dématérialiser les informations contenues dans ces documents afin de pouvoir les classifier et les analyser. On trouve aujourd’hui une large variété de systèmes ayant des objectifs bien définis, comme la segmentation de lignes manuscrites en mots ou la reconnaissance de mots. Un système de reconnaissance de documents a pour objectif de transformer un document physique en document numérique. Par exemple, on peut vouloir effectuer des recherches de contenus ou traiter les informations contenues dans ces documents, d’où l’intérêt d’extraire leur contenu. Dans le cas des documents image de type imprimés, on trouve une catégorie de systèmes permettant d'extraire le contenu et la mise en page : les OCR (Optical Character Recognition). Avec le développement de ces systèmes, est venu le besoin de les évaluer. Lorsqu'on parle d'évaluation, on peut se poser trois questions : Quels sont les aspects évaluables ? Quelles sont les contraintes à respecter ? Comment évaluer ces aspects ? Nous allons voir dans la première section les aspects évaluables des OCR puis nous verrons en seconde section, les contraintes imposées. Enfin le reste du rapport tentera, après un état de l'art, d'apporter une réponse à la troisième question : comment les évaluer

    Recognition-based Approach of Numeral Extraction in Handwritten Chemistry Documents using Contextual Knowledge

    Get PDF
    International audienceThis paper presents a complete procedure that uses contextual and syntactic information to identify and recognize amount fields in the table regions of chemistry documents. The proposed method is composed of two main modules. Firstly, a structural analysis based on connected component (CC) dimensions and positions identifies some special symbols and clusters other CCs into three groups: fragment of characters, isolated characters or connected characters. Then, a specific processing is performed on each group of CCs. The fragment of characters are merged with the nearest character or string using geometric relationship based rules. The characters are sent to a recognition module to identify the numeral components. For the connected characters, the final decision on the string nature (numeric or non-numeric) is made based on a global score computed on the full string using the height regularity property and the recognition probabilities of its segmented fragments. Finally, a simple syntactic verification at table row level is conducted in order to correct eventual errors. The experimental tests are carried out on real-world chemistry documents provided by our industrial partner eNovalys. The obtained results show the effectiveness of the proposed system in extracting amount fields

    Impact of Features and Classifiers Combinations on the Performances of Arabic Recognition Systems

    Get PDF
    International audienceArabic recognition is a very challenging task that begins to draw the attention of the OCR community. This work presents our latest contributions to this task, exploring the impact of several features and classifiers combinations on the performances of some developed systems. Different types of writings were considered (machine-printed, multi-fonts, handwritten , unconstrained, multi-writers, bi-dimensional, large vocabulary , ancient manuscripts). For each type of writing, we have considered both the most appropriate features and classifiers: contextual primitives to compensate the Arabic morphology variation, statistical features to recognize mathematical symbols and spectral features, mainly run lengths histogram-based features and histogram of oriented gradient-based descriptors to discriminate between machine-printed/handwritten and Ara-bic/Latin words. We have also used the shape context descriptor, for touching characters segmentation, which has been useful to train the models in the template-based recognition system. We have taken advantage of the Hough generalized transform to spot separator words in ancien arabic manuscripts. Otherwise Bayesian networks are used to apprehend the writing uncertainty and transparent neural networks to exploit the morphological aspect of Arabic language and integrate linguistic knowledge in the recognition process. The proposed systems are designed based on the characteristics, the similarities and the differences of Arabic writings

    Segmentation de documents composites par une technique de recouvrement des espaces blancs

    Get PDF
    International audienceWe present here a method for the segmentation of composite documents. Unlike most publications, we focus on non-Manhattan layouts which are usually created by compositing. Therefore, the pages to be processed contain several sub-documents which have to be isolated. We draw inspiration from the white space cover technique introduced by Baird et al. and a suite of pre- and post-processings specific to these particular documents. The evaluations are made on administrative records coming from various sources and provided to us by our industrial partner. As we do not have any groundtruth documents we compared our results with those obtained by a commercial OCR which is outperformed by our method.Nous présentons dans cet article une méthode pour la segmentation de documents composites. Contrairement à la majorité des publications, nous nous focalisons sur des documents à structure non-Manhattan qui sont généralement créés par montage. Les pages à traiter contiennent donc plusieurs sous-documents qu'il faut isoler. Nous nous inspirons d'une technique par recouvrement d'espaces blancs proposée par Baird et al. ainsi qu'une suite de pré-traitements et post-traitements spécifiques à ces documents particuliers. Les évaluations sont faites sur des documents administratifs d'origines diverses qui nous sont fournis par une société partenaire. Ne disposant pas de documents de vérité, nous avons comparé nos résultats à ceux d'OCR commerciaux que notre méthode surpasse

    ZoneMapAlt: An alternative to the ZoneMap metric for zone segmentation and classification

    Get PDF
    International audienceThis paper proposes a new evaluation metric based on the existing ZoneMap metric. The ZoneMap method, designed to perform a zone segmentation evaluation and classification, is considered in the context of OCR evaluation. Its limits are spotted, described and a new algorithm, ZoneMapAlt (ZoneMap Alternative) is proposed to solve the identified limits while keeping the properties of the original one. To validate the new metric, experiments have been made on a dataset of scientific articles. Results demonstrate that the ZoneMapAlt algorithm provides greater details on seg-mentation errors and is able to detect critical segmentation errors

    Metrics for Complete Evaluation of OCR Performance

    Get PDF
    International audienceIn this paper, we study metrics for evaluating OCR performance both in terms of physical segmentation and in terms of textual content recognition. These metrics rely on the OCR output (hypothesis) and the reference (also called ground truth) input format. Two evaluation criteria are considered: the quality of segmentation and the character recognition rate. Three pairs of input formats are selected among two types of inputs: text only (text) and text with spatial information (xml). These pairs of inputs reference-to-hypothesis are: 1) text-to-text, 2) xml-to-xml and 3) text-to-xml. For the text-to-text pair, we selected the RETAS method to perform experiments and show its limits. Regarding text-to-xml, a new method based on unique word anchors is proposed to solve the problem of aligning texts with different information. We define the ZoneMapAltCnt metric for the xml-to-xml approach and show that it offers the most reliable and complete evaluation compared to the other two. Open source OCRs like Tesseract and OCRopus are selected to perform experiments. The datasets used are a collection of documents from the ISTEX 1 document database, from French newspaper "Le Nouvel Observateur" as well as invoices and administrative document gathered from different collaborations

    Reconnaissance de formules mathématiques Arabes par un système dirigé par la syntaxe

    Get PDF
    L'objet de cette contribution est de présenter un système dirigé syntaxe qui reconnaît des formules mathématiques Arabes et retourne les résultats de la reconnaissance dans le format MathML. Un ensemble de règles de remplacement est défini par une grammaire de coordonnées pour analyser des formules mathématiques Arabes. Cette grammaire est employée en s'appuyant sur la reconnaissance de symboles et l'analyse de leur arrangement spatial. Nous avons utilisé les k plus proches voisins pour reconnaître des symboles mathématiques Arabes et un analyseur syntaxique à la fois descendant et ascendant qui repose sur la dominance d'opérateurs pour diviser récursivement la formule en sous formules plus simples. Dans le système proposé, les modules de la reconnaissance des symboles et de l'analyse structurelle s'interagissent d'une manière étroite. Il est ainsi possible d'utiliser des informations structurelles pour aider à deviner les symboles ambigus ou en confusion. Ce système de reconnaissance, dirigé par la syntaxe, a été démontré avec succès sur plusieurs types de formules se trouvant dans différents documents scientifiques Arabes

    Effects of treated wastewater irrigation on soil salinity and sodicity in Sfax (Tunisia): A case study

    Get PDF
    In arid regions such as near Sfax (Tunisia), treated wastewater effluents (TWE) are often applied as agricultural irrigation. Irrigation TWE usually contain large amounts of carbon, nitrogen and sodium. The objective of this study was to evaluate the impact of TWE irrigation on soil salinity and sodicity. In the city of Sfax, two sites were selected with two soil types (fluvisol and calcisol) having been irrigated for 4 and 15 years respectively. Soils were sampled at three different depths (0-30, 30-60 and 60-90 cm) in the TWE irrigated area and in a non-irrigated control area. Irrigated and non-irrigated study soils were analyzed for pH, nitrate and ammonia, electrical conductivity (ECs), exchangeable sodium percentage (ESP), sodium absorption ratio (SAR) and soil organic matter.The fluvisol, irrigated for only four years, is more affected by salinity than the calcisol irrigated for 15 years. In the upper fluvisol layer irrigated by the treated wastewater, ECs reach 8 mS•cm-1 and ESP a value of 15% while in all layers of the calcisol, ECs and ESP are lower and rarely exceed 4 mS•cm-1 and 6% respectively. This result is due to a combination of factors in the fluvisol treatment area including texture, cation exchange capacity, irrigation procedure and crop management.Dans les régions arides telles que le cas de Sfax (Tunisie), les eaux usées traitées (EUT) sont souvent utilisées en irrigation agricole. Généralement, les EUT sont riches en composés organiques, en azote et en sodium. L’objectif de cette étude est d’évaluer l’impact de l’irrigation par les EUT sur la salinité et la sodicité des sols. Dans la région de Sfax, deux sites ont été sélectionnés, représentant deux types de sols différents (fluvisol et calcisol) irrigués par les EUT, respectivement depuis 4 et 15 ans. Des échantillons des sols ont été prélevés systématiquement à trois profondeurs différentes (0-30, 30-60 et 60-90 cm) au niveau des parcelles irriguées et sur des placettes contrôle non irriguées (témoin). Sur chaque échantillon composite de sol, les pH (eau, KCl), teneurs en nitrate et ammonium, capacité d’échange cationique (CEC), conductivités électriques (CEs), taux de sodium échangeable (ESP), ratios d’absorption de sodium et teneurs en matières organiques ont été mesurés.Le fluvisol, irrigué depuis seulement quatre ans, est plus affecté par la salinité que le calcisol, irrigué depuis 15 ans. Dans les niveaux de surface du fluvisol, la CEs et l’ESP ont atteint les seuils critiques de 8 mS•cm-1 et 15 % respectivement, alors qu’au niveau du calcisol, la CEs et l’ESP sont plus faibles et dépassent rarement 4 mS•cm-1 et 5 % respectivement. Pour le fluvisol, ce résultat est dû à la combinaison de plusieurs facteurs impliquant la texture, la capacité d’échange cationique, la procédure d’irrigation et la rotation des cultures
    corecore